Skip to content

feat(engine): wire extensions and capabilities into runtime pipeline#2860

Open
gouslu wants to merge 24 commits into
open-telemetry:mainfrom
gouslu:gouslu/extension-system-p1-pr4
Open

feat(engine): wire extensions and capabilities into runtime pipeline#2860
gouslu wants to merge 24 commits into
open-telemetry:mainfrom
gouslu:gouslu/extension-system-p1-pr4

Conversation

@gouslu
Copy link
Copy Markdown
Contributor

@gouslu gouslu commented May 6, 2026

Change Summary

Part 4 of the Extension System (P1) series. Wires the previously
landed Capability Registry & Resolver (#2732) into the runtime
pipeline so extensions are actually instantiated, started, and
shut down by the engine, and so consumer nodes can resolve their
capability bindings at build time.

Highlights:

  • Runtime wiring in runtime_pipeline.rs: extension lifecycle
    is invoked before any data-path node is spawned, and Shutdown
    is delivered to extensions only after the data path drains
    ("started first, shut down last"). Active and passive extensions
    are handled separately; failures abort startup cleanly.
  • Local capability ownership aligned with shared via a
    Box-clone factory pattern, removing the prior asymmetry between
    the two trait variants.
  • Two reference test capabilities under
    crates/engine/src/capability/: NoOpStateless and
    NoOpStateful. They exercise every codegen path of the
    #[capability] proc macro (&self × {sync, async}, &mut self
    × {sync, async}, borrowed/owned returns, etc.).
  • Comprehensive end-to-end test suite at
    crates/engine/tests/extension_e2e.rs (26 tests) covering:
    passive/active/background extensions, lifecycle ordering and
    shutdown ordering, fail-fast on extension errors, dual-variant
    pruning, one-shot capability enforcement (all accessor
    combinations), shared mutable state across consumers via
    Arc/Rc for both local and shared trait variants, async
    &mut self invocation through boxed handles, and active
    extensions mutating shared state observed by capability
    consumers.
  • Architecture doc updated with a precise statement of the
    start-first/shut-down-last invariant (it orders lifecycle
    calls, not init completion) and a noted future consideration
    to add an opt-in readiness probe if/when an extension needs an
    init-complete guarantee.
  • URN unification: extension URNs now use the canonical
    4-segment form urn:<namespace>:extension:<id> (mirroring the
    receiver/processor/exporter convention), with a short form
    extension:<id>. The shared parser core lives in a new
    private crates/config/src/urn.rs; node_urn.rs and
    extension_urn.rs delegate to it with disjoint accepted-kind
    sets so the two URN types cannot be confused. As a consequence,
    NodeKind::Extension and the now-unreachable
    Error::ExtensionInNodesSection are removed. Misplacement
    errors include actionable hints (e.g. "declare under
    extensions: instead of nodes:"
    ).
  • All in-tree node factories (receivers, processors, exporters
    in core-nodes and contrib-nodes) updated to accept the new
    &Capabilities parameter; existing factories that don't depend
    on any capability simply ignore it.

What issue does this PR close?

How are these changes tested?

  • New extension_e2e.rs integration test (26 tests) exercises the
    wiring end-to-end against synthetic receivers/processors/
    exporters/extensions.
  • New unit tests in urn.rs cover the shared parser core and the
    misplacement-error hints; existing extension_urn and
    node_urn tests updated to assert the canonical 4-segment form.
  • Pipeline-level regression tests cover rejecting extension URNs
    in the nodes: section and node URNs in the extensions:
    section.
  • cargo xtask check (structure check + fmt + clippy --workspace --all-targets -- -D warnings + cargo test --workspace) passes
    cleanly. No new clippy warnings.

Are there any user-facing changes?

Yes:

  • Extension URN format: extension URNs now use
    urn:<namespace>:extension:<id> (4-segment) instead of the
    pre-existing 3-segment urn:<namespace>:<id>. Short form
    extension:<id> (expands to urn:otel:extension:<id>) is
    available as a developer convenience. Existing 3-segment
    extension URNs in pipeline configs must be updated; the bundled
    configs/fake-with-extension.yaml shows the new shape.
  • New extension authoring surface: Extension trait,
    ExtensionWrapper::builder typestate, the
    extension_capabilities! macro, and the test capabilities
    NoOpStateless / NoOpStateful are now reachable for
    external extension authors. The architecture doc captures the
    lifecycle contract.
  • Node factory signature now includes &Capabilities as a
    parameter; existing custom factories will need to accept (and
    may ignore) this new argument

gouslu and others added 3 commits May 5, 2026 20:15
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extension and node URNs now share one canonical 4-segment shape
`urn:<namespace>:<kind>:<id>` (e.g., `urn:microsoft:extension:azure_identity_auth`),
mirroring the receiver/processor/exporter convention. Short forms
`<kind>:<id>` work for both, expanding to `urn:otel:<kind>:<id>`.

The previous 3-segment extension form (`urn:<ns>:<id>`) is no longer
accepted. Misplacement errors include actionable hints, e.g.
'(declare under `extensions:` instead of `nodes:`)'.

- New private `crates/config/src/urn.rs` factors out the shared
  parser core (`parse_kinded_urn`, `build_canonical_urn`, segment
  validators, misplacement-hint formatter).
- `node_urn.rs` and `extension_urn.rs` now delegate to it; their
  accepted-kind sets are disjoint, so `NodeUrn` can never parse an
  extension URN and vice versa.
- Removed `NodeKind::Extension` from the node enum and all defensive
  match arms in `pipeline.rs`, `controller/lib.rs`,
  `controller/startup.rs`, `engine/lib.rs`.
- Removed `Error::ExtensionInNodesSection` (became unreachable —
  type-rejected at parse).
- Updated YAML fixtures, bundled `configs/fake-with-extension.yaml`,
  and all `extension_e2e.rs` test URN constants to the 4-segment form.
- Added regression tests for misplacement-hint error messages.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the rust Pull requests that update Rust code label May 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 85.99034% with 116 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.09%. Comparing base (3e94f1f) to head (e56d2f3).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2860      +/-   ##
==========================================
- Coverage   86.26%   86.09%   -0.18%     
==========================================
  Files         715      722       +7     
  Lines      272123   273773    +1650     
==========================================
+ Hits       234751   235708     +957     
- Misses      36848    37541     +693     
  Partials      524      524              
Components Coverage Δ
otap-dataflow 87.26% <85.99%> (+0.03%) ⬆️
query_abstraction 80.61% <ø> (ø)
query_engine 89.57% <ø> (-1.16%) ⬇️
otel-arrow-go 52.45% <ø> (ø)
quiver 92.25% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

gouslu and others added 5 commits May 6, 2026 12:58
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@gouslu gouslu marked this pull request as ready for review May 8, 2026 16:48
@gouslu gouslu requested a review from a team as a code owner May 8, 2026 16:48
gouslu and others added 5 commits May 8, 2026 10:12
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… validation

The file contains Jinja placeholders ({{core_start}}, {{core_end}}) but
was missing the .j2 suffix, so the validate-configs CI script picked it
up as a real config and failed YAML parsing. Sibling otlp-otlp.yaml.j2
and otlphttp-otlphttp.yaml.j2 already use the .j2 convention; align
otap-otap with them and update the three suite references.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from Box<> to Rc<> occupies most of this PR, makes sense.

Comment thread rust/otap-dataflow/crates/engine/src/capability/registry/tests.rs Outdated
Comment thread rust/otap-dataflow/crates/engine-macros/src/capability.rs
Comment thread rust/otap-dataflow/crates/engine/src/capability/registry/tests.rs
Comment thread rust/otap-dataflow/crates/engine/tests/extension_e2e.rs
Comment thread rust/otap-dataflow/crates/engine/src/runtime_pipeline.rs Outdated
gouslu and others added 5 commits May 12, 2026 09:29
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread rust/otap-dataflow/docs/extension-system-architecture.md Outdated
Comment thread rust/otap-dataflow/crates/engine/src/runtime_pipeline.rs Outdated
Comment thread rust/otap-dataflow/crates/engine/src/runtime_pipeline.rs Outdated
Comment thread rust/otap-dataflow/crates/controller/src/startup.rs
Comment thread rust/otap-dataflow/configs/fake-with-extension.yaml Outdated
#[test]
fn accepts_ref_self_with_lifetime() {
assert!(validate("trait Cap { fn get<'a>(&'a self) -> &'a str; }").is_none());
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for method type generics too? Since the generated handles are Box<dyn ...>, I’m not sure fn get<T>(&self, key: T) is actually dyn-compatible.

Comment thread rust/otap-dataflow/crates/engine-macros/src/capability.rs
Comment thread rust/otap-dataflow/docs/extension-system-architecture.md
gouslu and others added 2 commits May 12, 2026 14:20
Address review feedback on PR open-telemetry#2860:

- Reject method-level generic type and const parameters in
  validate_trait with a clear syn::Error. The macro generates
  Box<dyn local/shared::Trait> handles, and dyn-compatibility forbids
  generic type/const params on dispatchable methods. Lifetimes are
  unaffected and remain accepted. Previously the trait would expand
  and only fail later with a deep E0038 error.
  Refs: open-telemetry#2860 (comment)

- Reject destructured argument patterns (e.g. `(a, b): (u64, u64)`)
  in validate_trait with a syn::Error instead of a proc-macro panic
  deep in adapter codegen. The leftover defensive check in the
  catches this up-front.
  Refs: open-telemetry#2860 (comment)

- Update module docs to correct the misleading claim that method-level
  generics are supported, and document why generic type/const params
  on methods can't work with the Box<dyn Trait> handles.

- Add tests covering the new validation paths (lifetime accepted,
  generic type rejected, generic const rejected, destructured
  parameter rejected).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address review feedback on PR open-telemetry#2860: the requirements doc described
`shared` as "one runtime instance shared across cores", which conflicts
with the architecture doc's correct statement that pipeline-scoped
shared extensions are still instantiated per pipeline instance (per
core). Reconcile both docs around the same model:

- Execution model (`local` / `shared`) defines the *type constraints*
- Sharing boundary is determined by the *scope* at which the extension
  is declared. In Phase 1 the only scope is pipeline, so `shared`
  extensions are per-pipeline-instance (per core), with the `Send +
  Clone` bounds enabling true cross-core sharing later when group /
  engine scopes land.

Changes:

- extension-requirements.md: rewrite the execution-model table to
  surface type constraints and Phase 1 sharing boundary explicitly,
  add a Phase 1 callout for `shared`, and reframe the surrounding
  prose so it no longer reads as "shared = cross-core in Phase 1".
- extension-system-architecture.md: append a forward pointer in the
  thread-per-core requirement row so a table-only reader sees the
  per-core caveat already documented further down.

Refs: open-telemetry#2860 (comment)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
gouslu added a commit to gouslu/otel-arrow that referenced this pull request May 12, 2026
…bounded drain

Address review feedback on PR open-telemetry#2860:

- runtime_pipeline.rs: refactor the run loop into an inner async
  block whose Result is captured, so cleanup runs on every exit
  path ("async finally"). Previously, on any error the loop did
  `return Err(e)` immediately and dropped already-started
  extensions without sending Shutdown — extensions owning sockets,
  files, or background work missed their cleanup signal. Now the
  outer block always calls broadcast_shutdown() (idempotent on the
  normal path) and drain_until_deadline().await before propagating
  the loop's result. Behavior on the normal path is unchanged.
  Refs: open-telemetry#2860 (comment)

- extension_lifecycle.rs: add drain_until_deadline() that waits for
  remaining active+background extension tasks to finish but never
  past the broadcast deadline (plus a small slack to absorb the
  ControlChannel adapter's deferred Shutdown delivery and the
  task-return / JoinHandle-observed latency). A misbehaving
  extension that ignores Shutdown can no longer hang the pipeline
  indefinitely — it is dropped after EXTENSION_SHUTDOWN_GRACE +
  EXTENSION_SHUTDOWN_DRAIN_SLACK with a warning. broadcast_shutdown
  now also stashes the deadline so the drain stays consistent with
  what the wire-level message advertised.
  Refs: open-telemetry#2860 (comment)

- extension_e2e.rs: regression test
  test_other_extensions_receive_shutdown_when_pipeline_errors —
  configures one failing extension and one shutdown-recording
  extension (each bound by its own probe receiver to survive
  defined-but-unbound pruning). When the failing extension's
  start() errors, the pipeline aborts but the recording extension
  must still observe Shutdown before run_forever returns.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…bounded drain

On error, the run loop now flows through an "async finally" block that
broadcasts Shutdown to remaining extensions and bounded-drains them
before propagating the error, instead of dropping live extensions
mid-run. The drain is capped at EXTENSION_SHUTDOWN_GRACE + a small
slack so a misbehaving extension can't hang the pipeline. Adds a
regression test covering the error path.

Refs: open-telemetry#2860 (comment)
Refs: open-telemetry#2860 (comment)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@gouslu gouslu force-pushed the gouslu/extension-system-p1-pr4 branch from bb54448 to d17b453 Compare May 12, 2026 22:55
gouslu and others added 3 commits May 12, 2026 16:04
Mirror the existing per-node validation pass for extensions in
`validate_pipeline_components`. With `NodeKind::Extension` removed,
unknown extension URNs and bad extension configs were no longer
caught at startup and only surfaced at runtime when the engine tried
to resolve them. Now they fail fast with the same kind of
`Unknown ... component` / `Invalid config for ...` message we
already give for nodes.

Adds a unit test covering the unknown-URN path.

Refs: open-telemetry#2860 (comment)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The config referenced `urn:otap:extension:sample_shared_key_value_store`,
which has no registered ExtensionFactory anywhere in the binary. Until
this PR's earlier validation tightening (f9a67ea) the validate-configs
CI job missed it because extension URNs weren't walked. With the fix in
place, this YAML now fails validate-configs.

It also has no test/script/doc consumers — only the validate-configs
sweep was loading it. The schema is still demonstrated by the
`test_extension_with_config_and_capabilities` unit test in
crates/config/src/pipeline.rs, so deleting the orphan loses no
documentation value. A runnable demo extension can land in a follow-up
alongside a real factory.

Refs: open-telemetry#2860 (comment)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a "Runtime cost" sub-paragraph under the existing "Extensions
start first, shut down last" key design decision in
extension-system-architecture.md. Spells out that start() (and any
capability method dispatched on that runtime) shares the per-core
async runtime with the data path, so blocking I/O or CPU-heavy work
in those bodies starves the pipeline on that core. Recommends
spawn_blocking, a worker thread, or Rayon for off-runtime work. Same
guidance applies to background extensions.

Refs: open-telemetry#2860 (comment)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants